In this lab we will be critically engaging with existing datasets that have been used to address ethics in AI. In particular, we will explore the Jigsaw Toxic Comment Classification Challenge. This challenge brought to light bias in the data that sparked the Jigsaw Unintended Bias in Toxicity Classification Challenge.
In this lab, we will dig into the dataset ourselves to explore the biases. We will further explore other datasets to expand our thinking about bias and fairness in AI in relation to aspects such as demography and equal opportunity as well as performance and group unawareness of the model. We will learn more about that in the tutorial below.
This week, coding activity will be minimal, if any. However, as always, you will be expected to incorporate your analysis, thoughts and discussions into your notebooks as markdown cells, so I recommend you start up your Jupyter notebook in advance. As always, remember:
University of Glasgow One Drive should be visible in the home directory of the Jupyter Notebook. Other machines may require additional set up and/or navigation for One Drive to be directly accessible from Jupyter Notebook.This week we will make use of one of the Kaggle tutorials and their associated notebooks to learn how to identify different types of bias. Biases can creep in at any stage of the AI task, from data collection methods, how we split/organise the test set, different algorithms, how the results are interpreted and deployed. Some of these topics have been extensively discussed and as a response, Kaggle has developed a course on AI ethics:
Read through the first page of the [Kaggle tutorial on Identifying Bias in AI] to understand the scope of biases discussed at Kaggle.
How many types of biases are described on the page?
Which type of bias did you know about already before this course and which type was new to you?
Can you think of any others? Create a markdown cell below to discuss your thoughts on these questions.
Note that the biases discussed in the tutorial are not an exhaustive list. Recall that biases can exist across the entire machine learning pipeline.
Modify the markdown cell below to address the Tasks 2-a and 2-b.
Markdown for discussing bias
If there is inherent bias in the input data, its likely to show in the algorithm's output decisions.
Sampling bias, is the collection of data to develop a machine learning model. Over or under sampling can occur, leading to output data being biased towards a particular demographic.
Algorithm bias, is the algorithm chosen to develop a machine learning model. There are many to choose from such as linear regression or decision trees.
Go to the embedding projector at tensorflow.org. This may take some time to load so be patient! There is a lot of information being visualised. This will take especially long if you select "Word2Vec All" as your dataset. The projector provides a visualisation of the langauge language model called Word2Vec.
This tool also provides the option of visualising the organisation of hand written digits from the MNIST dataset to see how data representations of the digits are clustered together or not. There is also the option of visualising the iris dataset from scikit-learn with respect to their categories. Feel free to explore these as well if you like.
For the current exercise, we will concentrate on exploring the relationships between the words in the Word2Vec model. First, select Word2Vec 10K from the drop down menu (top lefthand side). This is a reduced version of Word2Vec All. You can search for words by submitting them in the search box on the right hand side.
apple and click on Isolate 101 ppints. This reduces the noise. Note how juice, fruit, wine are closer together than macintosh, computers and atari. silver and sound. What are your observations. Does it seem like words related to each other are sitting closer to each other?engineer, drummeror any other occupation - what do you find? Modify the markdown cell below to present your thoughts.
Markdown cell for discussing large language models
Word - Engineer Count - 607 Related Words Examples - architect, technology, science, electrical and mechanical.
Word - Drummer Count - 274 Related Words Examples - musician, pianist, jazz, bands and percussion.
from IPython.display import*
Image ("engineer.png")
from IPython.display import*
Image ("drummer.png")
So we now know that AI models (e.g. large language models) can be biased. We saw that with the embedding projector already. We discussed in the previous exercise about the machine learning pipeline, how the assessment of datasets can be crucicial to deciding the suitability of deploying AI in the real world. This is where data connects to questions of fairness.
Read through the page to understand the scope of the fairness criteria discussed at Kaggle. Just as we dicussed with bias, the fairness criteria discussed at Kaggle is not exhaustive.
How many criteria are described on the page?
Which criteria did you know about already before this course and which, if any, was new to you?
Can you think of any other criteria? Create a markdown cell and note down your discussion with your peer group on these questions.
Scroll down to the end of the page on AI fairness to find a link to another interactive exercise to run code in a notebook using credit card application data.
Report the results of the activity and discussion by modifying the markdown cell below.
Markdown cell for discussing fairness
from IPython.display import*
Image ("varieties-of-fairness-1.png")
from IPython.display import*
Image ("model.png")
from IPython.display import*
Image ("baseline-model.png")
from IPython.display import*
Image ("group-unaware-model.png")
from IPython.display import*
Image ("varieties-of-fairness-2.png")
from IPython.display import*
Image ("evaluated-model.png")
from IPython.display import*
Image ("final-model.png")
In this section we will explore the reasons behind decisions that AI makes. While this is really hard to know, there are some approaches developed to know which features in your data (e.g. median_income in the housing dataset we used before) played a more important role than others in determining how your machine learning model performs. One of the many approaches for assessing feature importance is permutation importance.
The idea behind permutation importance is simple. Features are what you might consider the columns in a tabulated dataset, such as that might be found in a spreadsheet.
To make this idea more concrete, read through the page at the Tutorial on Permutation Importance at Kaggle. The page describes an example to "predict a person's height when they become 20 years old, using data that is available at age 10".
The page invites you to work with code to calculate the permutation importance of features for an example in football to predict "whether a soccer/football team will have the "Man of the Game" winner based on the team's statistics". Scroll down to the end of the page to the section "Your Turn" where you will find a link to an exercise to try it yourself to calculate the importance of features in a Taxi Fare Prediction dataset.
Excercise Results:
Question One Solution - It would be helpful to know whether New York City taxis vary prices based on how many passengers they have. Most places do not change fares based on numbers of passengers. If you assume New York City is the same, then only the top 4 features listed should matter. At first glance, it seems all of those should matter equally.
Question Two Solution
from IPython.display import*
Image ("q2.png")
Question Three Solution - 1. Travel might tend to have greater latitude distances than longitude distances. If the longitudes values were generally closer together, shuffling them wouldn't matter as much. 2. Different parts of the city might have different pricing rules (e.g. price per mile), and pricing rules could vary more by latitude than longitude. 3. Tolls might be greater on roads going North<->South (changing latitude) than on roads going East <-> West (changing longitude). Thus latitude would have a larger effect on the prediction because it captures the amount of the tolls.
Question Four Solution
from IPython.display import*
Image ("q4.png")
Question 5 Solution - The scale of features does not affect permutation importance per se. The only reason that rescaling a feature would affect PI is indirectly, if rescaling helped or hurt the ability of the particular learning method we're using to make use of that feature. That won't happen with tree based models, like the Random Forest used here. If you are familiar with Ridge Regression, you might be able to think of how that would be affected. That said, the absolute change features are have high importance because they capture total distance traveled, which is the primary determinant of taxi fares...It is not an artifact of the feature magnitude.
Question Six Solution - We cannot tell from the permutation importance results whether traveling a fixed latitudinal distance is more or less expensive than traveling the same longitudinal distance. Possible reasons latitude feature are more important than longitude features 1. latitudinal distances in the dataset tend to be larger 2. it is more expensive to travel a fixed latitudinal distance 3. Both of the above If abs_lon_change values were very small, longitues could be less important to the model even if the cost per mile of travel in that direction were high.
Permutation importance is a reasonable measure of feature importance in AI as it evaluates the the impact of random values of features on a model's performance.
An issue is that it may not capture interactions between features, as it assesses features in isolation. ALso that it can be expensive, especially for models with large numbers, as it will need to re-evaluate each feature.
Apart from the Jigsaw Toxic Comment Classification Challenge another challenge you might explore is the Inclusive Images Challenge. Read at least one of the following.
There are many concepts (e.g. model cards and datasheets) omitted in discussion above about AI and Ethics. To acquire a foundational knowledge of transparency, accessibility and fairness:
In this lab, you explored a number of areas that pose challenges with regard to AI and ethics: bias, fairness and explainability. This, and other topics in reposible AI development, is currently at the forefront of the AI landscape.
The discussions coming up in the lectures on applications of AI (to be presented by guest lecturers in the weeks to come) will undoubtedly intersect with these concerns. In preparation, you might think, in advance, about what distinctive questions about ethics might arise in AI applications in law, language, finance, archives, generative AI and beyond.